Exploratory Data Analysis on US Real Estate¶

estate

INTRODUCTION¶

The United States is the largest economy in the world. Based on realtrends article, real estate accounts for 16.9% of US's GDP on 2021 which is considered a major GDP contributor. The following analysis aims to provide viewers with real estate insights that may facilitate general understanding and potential investment decisions through market trends and factors influencing the real estate market.

In [36]:
# Importing Libraries 
import pandas as pd 
import numpy as np
import seaborn as sns 
import statsmodels.api as sm
import matplotlib.pyplot as plt
from IPython.display import HTML

HTML('''
    <script>
        code_show=true; 
        function code_toggle() {
         if (code_show){
         $('div.input').hide();
         } else {
         $('div.input').show();
         }
         code_show = !code_show
        } 
        $( document ).ready(code_toggle);
    </script>
    <form action="javascript:code_toggle()">
        <input type="submit" value="Toggle Code">
    </form>
''')
Out[36]:
In [2]:
# Importing data 
raw_real_estate = pd.read_csv("realtor_data.csv")

DATA CLEANING AND VALIDATION¶

This stage contains initial exploration of the real estate dataset. It involves identifying missing values, reformating certain columns for analysis purpose, filtering out duplicated datas and keep thedata required for analysis purpose.

In [3]:
# Initial Exploration
# Identifying the shape of the dataset 
shape = raw_real_estate.shape
print(shape)
# identifying the data types of all the columns
data_types = raw_real_estate.dtypes
(923159, 12)
In [4]:
# Identifying the number of missing values in each the columns 
missing_values = raw_real_estate.isna().sum().reset_index()
missing_values.columns = ['Column_Names', 'Missing_value_Count']
display(missing_values)
Column_Names Missing_value_Count
0 status 0
1 price 71
2 bed 131703
3 bath 115192
4 acre_lot 273623
5 full_address 0
6 street 2138
7 city 74
8 state 0
9 zip_code 205
10 house_size 297843
11 sold_date 466763
In [5]:
# Changing the date column into "date" data type 
raw_real_estate['sold_date'] = pd.to_datetime(raw_real_estate['sold_date'], format = '%Y-%m-%d')
In [6]:
# Removing rows with missing values in certain columns
raw_real_estate = raw_real_estate.dropna(subset = ['price', 'bed', 'bath', 'acre_lot', 'zip_code', 'house_size', 'sold_date'])
In [7]:
## Filtering for datas for years between year 2000 and 2022 (ordered)

# Finding the max and min dates in the dataset 
raw_real_estate['sold_date'].agg(['min', 'max'])

# Filtering the date from the range of dates in order
raw_real_estate = raw_real_estate[(raw_real_estate['sold_date'] >= '2000-01-01')
                     & (raw_real_estate['sold_date'] <= '2022-12-31')].sort_values("sold_date")

# Filtering the columns needed for further analysis
final_real_estate = raw_real_estate.loc[:, (raw_real_estate.columns != 'street') ]

# Removing Duplicate Values 
final_real_estate = final_real_estate.drop_duplicates(subset = ["full_address", "sold_date"])

# Reformating the price column 
final_real_estate['price'] = final_real_estate['price']/1000

#  Create the column called "year"
final_real_estate['year'] = pd.DatetimeIndex(final_real_estate['sold_date']).year

# Renaming the price column
final_real_estate.rename(columns = {"price": "Sale_Price_Thousands"}, inplace = True) 
In [8]:
# Adding a new column called Price per Squre Foot (Value for Money)
final_real_estate['price_per_sqft'] = (final_real_estate['Sale_Price_Thousands']/final_real_estate['house_size']) * 1000
In [9]:
# Adding a new column called room ratio (feeling of balance of the house)
final_real_estate['room_ratio'] = final_real_estate['bed']/final_real_estate['bath']
In [31]:
# Downloading the new code
final_real_estate.to_csv("final_real_estate.csv", index = False)
In [10]:
display(final_real_estate.head(10))
status Sale_Price_Thousands bed bath acre_lot full_address city state zip_code house_size sold_date year price_per_sqft room_ratio
920877 for_sale 225.0 3.0 2.0 1.18 21 Morris Dr, Newburgh, NY, 12550 Newburgh New York 12550.0 1245.0 2000-01-03 2000 180.722892 1.500000
634864 for_sale 451.0 4.0 3.0 0.29 16 Underwood Ct, Burlington, NJ, 08016 Burlington New Jersey 8016.0 2085.0 2000-01-04 2000 216.306954 1.333333
627427 for_sale 289.0 1.0 1.0 0.13 113 Cleveland Ln, Rockaway, NJ, 07866 Rockaway New Jersey 7866.0 915.0 2000-01-04 2000 315.846995 1.000000
229776 for_sale 350.0 3.0 2.0 0.15 31 Benefit St, Pawtucket, RI, 02861 Pawtucket Rhode Island 2861.0 1500.0 2000-01-04 2000 233.333333 1.500000
448440 for_sale 365.0 2.0 1.0 0.95 49 Tinker Hill Rd, Washington, CT, 06777 Washington Connecticut 6777.0 720.0 2000-01-04 2000 506.944444 2.000000
590524 for_sale 949.9 5.0 5.0 5.04 5 Country Ln, Tewksbury Township, NJ, 07830 Tewksbury Township New Jersey 7830.0 4508.0 2000-01-04 2000 210.714286 1.000000
778538 for_sale 695.0 6.0 2.0 0.07 1324 Star Ave, Elmont, NY, 11003 Elmont New York 11003.0 1770.0 2000-01-04 2000 392.655367 3.000000
612420 for_sale 85.0 3.0 1.0 0.02 1126 Chestnut St, Wilmington, DE, 19805 Wilmington Delaware 19805.0 1075.0 2000-01-04 2000 79.069767 3.000000
449429 for_sale 349.0 3.0 2.0 0.18 11 Dartmouth Ln, Danbury, CT, 06810 Danbury Connecticut 6810.0 1170.0 2000-01-05 2000 298.290598 1.500000
686596 for_sale 154.9 4.0 1.0 0.22 216 South Ave, Bridgeton, NJ, 08302 Bridgeton New Jersey 8302.0 1008.0 2000-01-05 2000 153.670635 4.000000

KEY ANALYSIS STAGE¶

Overall Market Trend and Analysis¶

Identifying total real estate sales volumes from year 2000 to 2022¶

A usual business cycle can be identified in the time series plot below where the peaks occured at 2005 and 2017 and the downturn at 2000, 2011 and 2022. The downturn occured at 2011 maybe considered weak recoveries from the 2008 global financial crisis and 2022 downturn maybe caused by the covid pandemic and the rise of interest rates.

In [11]:
# Finding the sum of real estate sales Volume in each of the year 
General_Sales_Volume = final_real_estate['year'].value_counts().sort_index()

# Plotting the Real Estate Sales Volume over the years 
Sales_Volume_Plot = General_Sales_Volume.plot(title = "Total Real Estate Sales Volume from 2000 to 2022", 
                                              xlabel = "year", ylabel = "Sales_Volume", kind = "line")

Identifying total real estate sales value from year 2000 to 2022¶

The real estate sales value time series plot below shows a similar pattern as the sales volume time series plot. This may be due to the large value each real estate carries which makes changes in real estate sales very obvious.

In [12]:
# Finding the sum of real estate sales Value in each of the year 
General_Sales_Values = pd.DataFrame(final_real_estate.groupby('year')['Sale_Price_Thousands'].sum())
Sales_Revenue_Plot = General_Sales_Values.plot(title = "Total Real Estate Sales Values from 2000 to 2022", 
                                               ylabel = "Sales_Price_Thousands", kind = "line", legend = False)

Which State have the highest median property price?¶

Currently Virgin islands is a state with the highest median property price however, there were only datas on 2 properties listed on Virgin Island and one of the property price was heavily skewed thus creating an outlier. Therefore, its not a good indicator of a median prices in Virgin Islands and thus will be removed from the chart for more accurate analysis.

In [13]:
# find the median house prices in each state
median_prices = final_real_estate.groupby("state")['Sale_Price_Thousands'].median().reset_index().sort_values(by='Sale_Price_Thousands', ascending=False)
median_prices_except_highest = median_prices.iloc[1:]

plt.figure(figsize=(12, 6))
sns.barplot(data=median_prices_except_highest, x='state', y='Sale_Price_Thousands', palette='viridis')
plt.xticks(rotation=90)
plt.xlabel("State")
plt.ylabel("Median Property Price (Thousands $)")
plt.title("Median Property Price by State")
plt.show()
In [14]:
# identifying all the properties from Virgin Islands
virgin_islands_rows = final_real_estate[final_real_estate["state"] == "Virgin Islands"]
display(virgin_islands_rows)
status Sale_Price_Thousands bed bath acre_lot full_address city state zip_code house_size sold_date year price_per_sqft room_ratio
11371 for_sale 950.0 5.0 4.0 0.99 46 Water Isle, Saint Thomas, VI, 00802 Saint Thomas Virgin Islands 802.0 5000.0 2013-10-11 2013 190.000000 1.250000
10351 for_sale 6899.0 4.0 6.0 0.83 10 Water Isle, Saint Thomas, VI, 00802 Saint Thomas Virgin Islands 802.0 4600.0 2018-04-05 2018 1499.782609 0.666667

Identifying the Sales Volume of the top 5 states¶

New Jersey has the highest Real Estate sales volume across 22 years while New York is slightly ahead of Pennsylvania.

In [15]:
#2. Identifying the sales volume based on states 

# Identifying the number of states available in the US
final_real_estate['state'].nunique()

# Identifying the top 5 states in terms of real estate sales volume
final_real_estate['state'].value_counts().sort_values().tail(5).plot(kind = 'barh', 
                                                                     color = ['Green', 'Blue', 'Orange', 'Brown', 'Purple'])
plt.ylabel("States")
plt.xlabel("Real Estate Sales Volume")
plt.title("Top 5 US States Real Estate Sales Volume")
Out[15]:
Text(0.5, 1.0, 'Top 5 US States Real Estate Sales Volume')

Identifying the sales volume of the top 5 US states over the years¶

New Jersey has the highest Real Estate sales volume each year for 22 years while the New York state has fallen behind Pennsylvania in recent years. Furthermore, Massachusetts have the least variation and lowest real estate sales for the past 12 years. Overall the real estate sales volume in the top 5 states follows a similar trend over the past 22 years.

In [16]:
# Identifying the real estate sales volume in the top 5 states over the years 

# Filter out datas that only contains the top 5 states in terms of volume 
top_5_states = final_real_estate[final_real_estate.state.isin(["New Jersey", "Connecticut", "New York", "Pennsylvania", "Massachusetts"])]
top_5_states_sales_volume = pd.DataFrame(top_5_states.groupby(['year', 'state']).size()).rename(columns = {0: "State_Count"})

# Plotting the 
Palette = ["Brown", "Green", "Purple", "Orange", "Blue"]
sns.set_palette(Palette)
top_5_states_sales_volume = sns.relplot(x = "year", y = "State_Count", 
                                        data = top_5_states_sales_volume, kind = "line", hue = "state")
top_5_states_sales_volume.fig.suptitle("Top 5 States Real Estate Sales Volume from year 2000 to 2022", y = 1)
Out[16]:
Text(0.5, 1, 'Top 5 States Real Estate Sales Volume from year 2000 to 2022')

Identifying the sales value of the top 5 US states¶

The top 5 US states for sales volume and sales value are the same, however, the New York state has taken the first position having the most real estate sales value for the past 22 years.

In [17]:
#2b Identifying the top 5 states based Sale_Price_Thousandson real estate sales value
final_real_estate.groupby('state')['Sale_Price_Thousands'].sum().sort_values().tail(5).plot(kind = "barh", 
                                                color = ['Blue', 'Green', 'Brown', 'Purple', "Orange"])
plt.xlabel("Real Estate Sales Value")
plt.ylabel("States")
plt.title("Top 5 US States Real Estate Sales Value")
Out[17]:
Text(0.5, 1.0, 'Top 5 US States Real Estate Sales Value')

Median Prices Per Sqft Across the States¶

From the Prices per Sqft analysis, it shows only New York and Massachusetts have median price per sqft that is above 300 dollars while other states have quite similar median price per sqft. This would provide some insights of value for money and the prices of properties in each states, however it is important to note that this metric does not consider qualitative factors such as property layout and design and the benefits specific locations bring.

In [18]:
filtered_data = final_real_estate[~final_real_estate['state'].isin(['Puerto Rico', 'Virgin Islands'])]
state_pricespersqft = filtered_data.groupby('state')['price_per_sqft'].median().sort_values()

# generate unique colours for each state using seaborn
colors = sns.color_palette('tab20', len(state_pricespersqft))

state_pricespersqft.plot(kind = 'barh', color = colors)


plt.xlabel("Prices per Sqft")
plt.ylabel("States")
plt.title("Prices per Sqft Across the US States")
Out[18]:
Text(0.5, 1.0, 'Prices per Sqft Across the US States')
In [19]:
state_counts = final_real_estate['state'].value_counts()
display(state_counts)
New Jersey        7348
Connecticut       5113
New York          4240
Pennsylvania      3999
Massachusetts     1920
Rhode Island      1152
New Hampshire      682
Delaware           602
Vermont            432
Maine              363
Virgin Islands       2
Puerto Rico          1
Name: state, dtype: int64

Identify the most expensive real estate sold in each year¶

In [20]:
#3 Identify the Information of the Most expensive real estate in each year.
Most_Expensive_Real_Estate = top_5_states.loc[top_5_states.groupby('year')['Sale_Price_Thousands'].idxmax()]
display(Most_Expensive_Real_Estate)
status Sale_Price_Thousands bed bath acre_lot full_address city state zip_code house_size sold_date year price_per_sqft room_ratio
922453 for_sale 39000.0 8.0 13.0 8.23 555 Lake Ave, Greenwich, CT, 06830 Greenwich Connecticut 6830.0 18954.0 2000-07-17 2000 2057.613169 0.615385
649077 for_sale 13950.0 10.0 14.0 16.93 770 Godfrey Rd, Villanova, PA, 19085 Villanova Pennsylvania 19085.0 19000.0 2001-03-07 2001 734.210526 0.714286
560341 for_sale 32500.0 8.0 8.0 0.05 20 E 73rd St, New York, NY, 10021 New York New York 10021.0 11371.0 2002-03-21 2002 2858.147920 1.000000
409975 for_sale 60000.0 9.0 10.0 400.00 450 Brickyard Rd, Woodstock, CT, 06281 Woodstock Connecticut 6281.0 18777.0 2003-08-28 2003 3195.398626 0.900000
495469 for_sale 25000.0 6.0 11.0 0.07 70 Broad St, New York City, NY, 10004 New York City New York 10004.0 19478.0 2004-07-12 2004 1283.499333 0.545455
204657 for_sale 15000.0 6.0 6.0 4.50 2 Squibnocket Rd, Chilmark, MA, 02535 Chilmark Massachusetts 2535.0 4706.0 2005-10-04 2005 3187.420314 1.000000
734864 for_sale 18500.0 3.0 5.0 1.94 151 E 58th St Ph 55W, New York City, NY, 10022 New York City New York 10022.0 4500.0 2006-07-11 2006 4111.111111 0.600000
743986 for_sale 20500.0 4.0 5.0 1.33 New York City, NY, 10023 New York City New York 10023.0 3333.0 2007-12-20 2007 6150.615062 0.800000
743290 for_sale 50000.0 6.0 8.0 0.22 995 5th Ave Unit Ph, New York City, NY, 10028 New York City New York 10028.0 6891.0 2008-09-12 2008 7255.840952 0.750000
64850 for_sale 15000.0 8.0 10.0 157.00 47 White Bridge Rd, Chatham, NY, 12136 Chatham New York 12136.0 14058.0 2009-10-20 2009 1067.008109 0.800000
917730 for_sale 10600.0 8.0 11.0 4.02 488 West Rd, New Canaan, CT, 06840 New Canaan Connecticut 6840.0 11923.0 2010-07-12 2010 889.037994 0.727273
743968 for_sale 17950.0 7.0 8.0 0.23 535 W End Ave Unit 12TH, New York City, NY, 10024 New York City New York 10024.0 8451.0 2011-01-05 2011 2124.008993 0.875000
917641 for_sale 16000.0 7.0 9.0 47.16 187 Umpawaug Rd, Redding, CT, 06896 Redding Connecticut 6896.0 7590.0 2012-11-28 2012 2108.036891 0.777778
743414 for_sale 45000.0 4.0 10.0 0.05 12 E 82nd St, New York City, NY, 10028 New York City New York 10028.0 12000.0 2013-02-05 2013 3750.000000 0.400000
463153 for_sale 32000.0 6.0 12.0 10.00 16 Hurlingham Dr, Greenwich, CT, 06831 Greenwich Connecticut 6831.0 23700.0 2014-02-21 2014 1350.210970 0.500000
898460 for_sale 19000.0 4.0 5.0 0.05 18 W 11th St, New York City, NY, 10011 New York City New York 10011.0 6000.0 2015-06-09 2015 3166.666667 0.800000
781352 for_sale 18250.0 5.0 6.0 0.05 116 E 65th St, New York City, NY, 10065 New York City New York 10065.0 6950.0 2016-04-28 2016 2625.899281 0.833333
742416 for_sale 16000.0 7.0 12.0 0.05 224 W 22nd St, New York City, NY, 10011 New York City New York 10011.0 7640.0 2017-01-20 2017 2094.240838 0.583333
751525 for_sale 12995.0 4.0 4.0 0.02 40 Leroy St, New York City, NY, 10014 New York City New York 10014.0 3650.0 2018-07-27 2018 3560.273973 1.000000
640427 for_sale 29900.0 5.0 3.0 209.50 944 Providence Rd, Newtown Square, PA, 19073 Newtown Square Pennsylvania 19073.0 2900.0 2019-02-07 2019 10310.344828 1.666667
461367 for_sale 13650.0 7.0 10.0 2.53 32 Grahampton Ln, Greenwich, CT, 06830 Greenwich Connecticut 6830.0 10534.0 2020-06-26 2020 1295.804063 0.700000
656439 for_sale 12500.0 5.0 6.0 0.66 5499 Dune Dr, Avalon, NJ, 08202 Avalon New Jersey 8202.0 4462.0 2021-01-11 2021 2801.434334 0.833333
678035 for_sale 11500.0 9.0 13.0 0.85 4816 5th Ave, Avalon, NJ, 08202 Avalon New Jersey 8202.0 10000.0 2022-12-31 2022 1150.000000 0.692308

At which states are the most expensive real estate sold located in?¶

Out of 22 years, the New York state sold the most expensive real estate in the US for 12 years, outbeating other states by at least 50%. Moreover, this may mean that real estates in New York state tends to appreciate more and/or has higher value due to higher living standards.

In [21]:
#3b At which states are the most expensive real estates sold located in? 
plot = Most_Expensive_Real_Estate['state'].value_counts().plot(kind = "bar", 
                                                                color = ['Orange', 'Brown', 'Blue', 'Purple', "Green"])

plt.xticks(rotation = 0)
plt.xlabel("State")
plt.ylabel("Most Expensive Real Estate Count")
plt.title("Most Expensive Real Estate Count in the Top 5 States")
Out[21]:
Text(0.5, 1.0, 'Most Expensive Real Estate Count in the Top 5 States')

Relationship Analysis¶

Identifying the relationships between number of bathrooms, bedrooms, acre_lot and house size to the sale price.¶

From the correlation matrix shown below, turns out number of bathrooms has the strongest positive linear relationship to the sale price compared to other variables. It may be unexpected that the size of the real estate is not the variable with the strongest linear relationship. However, its important to point out that there are other factors that influences real estate prices such as the environment, location choice and the ability of the property to demand higher rents which is influenced by the purpose of using the land.

In [22]:
# Selecting columns for correlation calculation
columns = ['Sale_Price_Thousands', 'bath', 'bed', 'acre_lot', 'house_size']
subset = top_5_states[columns]

display(subset.corr())
Sale_Price_Thousands bath bed acre_lot house_size
Sale_Price_Thousands 1.000000 0.554682 0.296698 0.008957 0.431435
bath 0.554682 1.000000 0.618261 0.007481 0.600317
bed 0.296698 0.618261 1.000000 -0.000537 0.465522
acre_lot 0.008957 0.007481 -0.000537 1.000000 0.004948
house_size 0.431435 0.600317 0.465522 0.004948 1.000000

House Size Vs House Price¶

The the correlation between House Size and House Price is 0.4314. Which shows that House size does not have a strong positive relationship with the housing prices

In [23]:
correlation = final_real_estate[['house_size', 'Sale_Price_Thousands']].corr()

# Visualize
plt.figure(figsize=(8, 5))
sns.scatterplot(data=final_real_estate, x='house_size', y='Sale_Price_Thousands', alpha=0.4)
plt.title("House Size vs. Price")
plt.xlabel("House Size (sqft)")
plt.ylabel("Price ($)")
plt.show()

Number of Bathrooms vs Price¶

The number of bathrom has the strongest positive relationship with the housing price based on the correlation matrix (0.55), however it is important to note that there are other factors that would contribute to housing prices such as the population density of the state and other aspects of the property itself.

In [24]:
correlation = final_real_estate[['bath', 'Sale_Price_Thousands']].corr()

# Visualize
plt.figure(figsize=(8, 5))
sns.scatterplot(data=final_real_estate, x='bath', y='Sale_Price_Thousands', alpha=0.4)
plt.title("Number of Bathrooms vs. Price")
plt.xlabel("Number of Bathrooms")
plt.ylabel("Price ($)")
plt.show()

What type of house is the most popular over the years? (Based on the number of bathrooms and bedrooms)¶

Based on the table below, a 3 bedroom and 2 bathroom property is the most popular real estate in terms of sales volume every year for the past 22 years. This maybe due to the fact that this is the average family size in the US which drives the most demand for this type of property structure. Moreover, the spike of property sales during 2021 may be because of the covid 19 pandemic which may have affected the financial wellbeing of alot of people, which led to the sale of property. Moreover, the drop in property sales on 2022 may be due to insufficient data collected for the year.

In [35]:
# Value_counts() would remove the column names, but by reindexing, we can set the column names
raw_top_house = final_real_estate.groupby('year')[['bed', 'bath']].value_counts().reset_index()
raw_top_house.columns = ['year', 'bed', ' bath', 'counts']

# Generating the index to identify the max counts for each year (idx is in boolean)
idx = raw_top_house.groupby(['year'])['counts'].transform(max) == raw_top_house['counts']
# Will only return rows that matches True
popular_property = raw_top_house[idx]


import plotly.express as px

# Create the line chart
fig = px.line(
    popular_property,
    x='year',
    y='counts',
    markers=True,
    title= "Sales Trend for the Most demanded property over the years (3 bedroom and 2 bathrooms)",
    labels={'year': 'Year', 'counts': 'Number of Listings'}
)


# Customize hover tooltip (optional)
fig.update_traces(hovertemplate='Year: %{x}<br>Listings: %{y}')


# Show the interactive chart
fig.show(renderer = "notebook")

What are the least popular property in each of the year?¶

From the results, property with unbalanced bathroom to bedroom ratio property or properties that are much larger in size (bedroom and bathroom numbers) tend to be less popular due to the average household size which makes such properties harder to sell.

In [37]:
# Value_counts() would remove the column names, but by reindexing, we can set the column names
raw_top_house = final_real_estate.groupby('year')[['bed', 'bath']].value_counts().reset_index()
raw_top_house.columns = ['year', 'bed', ' bath', 'counts']

# Generating the index to identify the max counts for each year (idx is in boolean)
idx_min = raw_top_house.groupby(['year'])['counts'].transform(min) == raw_top_house['counts']
# Will only return rows that matches True

pd.set_option('display.max_rows', 500) 
display(raw_top_house[idx_min].head(10))
year bed bath counts
34 2000 11.0 12.0 1
35 2000 11.0 6.0 1
36 2000 11.0 11.0 1
37 2000 9.0 9.0 1
38 2000 7.0 6.0 1
39 2000 1.0 3.0 1
40 2000 3.0 5.0 1
41 2000 3.0 6.0 1
42 2000 4.0 6.0 1
43 2000 6.0 12.0 1
In [32]:
has_duplicates = final_real_estate.duplicated().any()
In [28]:
# Splitting the dataset by popularity 
#1 Count how many times each (bed, bath) combination appears 

combo_counts = final_real_estate.groupby(['bed', 'bath']).size().reset_index(name = 'Sales Count (Bed and Bath)')

#2 Merge the counts to the original dataset
final_real_estate_counts = final_real_estate.merge(combo_counts, on = ['bed', 'bath'], how = 'left')

#3 use quantiles to define popularity levels 
q_low = final_real_estate_counts['Sales Count (Bed and Bath)'].quantile(0.33)
q_high = final_real_estate_counts['Sales Count (Bed and Bath)'].quantile(0.66)

# Create a new column for popularity
def popularity_label(count):
    if count <= q_low:
        return 'Least Popular'
    elif count <= q_high:
        return 'Average Popularity'
    else:
        return 'Most Popular'
# apply the function to the dataset
final_real_estate_counts['popularity'] = final_real_estate_counts['Sales Count (Bed and Bath)'].apply(popularity_label)        

What makes a property less popular? (Based on bed to bathroom ratio)¶

From the below table, it is showned that majority of the property with room ratio of less than 1 or more than 1.5 are categorised as less popular. Generally speaking less popular property do not have a very good balance of bedroom and bathroom, or property with too much bedrooms and bathroom tend to less attractive as it exceed the average people per household in the US and it may be difficult for investors to resell it due to its size and demand for the property.

In [38]:
least_popular_property = final_real_estate_counts[final_real_estate_counts['popularity'] == 'Least Popular']

display(least_popular_property.groupby(['bed', 'bath', 'room_ratio']).size().reset_index().head(10))
bed bath room_ratio 0
0 1.0 1.0 1.000000 347
1 1.0 2.0 0.500000 78
2 1.0 3.0 0.333333 4
3 1.0 4.0 0.250000 2
4 2.0 1.0 2.000000 1366
5 2.0 3.0 0.666667 346
6 2.0 4.0 0.500000 25
7 2.0 5.0 0.400000 2
8 3.0 4.0 0.750000 456
9 3.0 5.0 0.600000 56

What makes a property popular? (Based on bed to bathroom ratio)¶

From the analysis, a 3 bedroom and 2-3 bathroom property are the most popular because the demand for a property is heavily driven by the usage of the property, on average a US household consist of 3.15 people, thus property with this size is the most popular.

In [30]:
most_popular_property = final_real_estate_counts[final_real_estate_counts['popularity'] == 'Most Popular']

# identify the unique room_ratio 
most_popular_property.groupby(['bed', 'bath','room_ratio']).size().reset_index()
Out[30]:
bed bath room_ratio 0
0 3.0 2.0 1.5 5255
1 3.0 3.0 1.0 2846

CONCLUSION¶

The top 5 States for real estate sales in the US are New Jersey, Connecticut, New York, Pennsylvania and Massachusetts for the past 22 years. If this trend continues, the real estates in these states may be potential investment options as the market is quite active, thus more liquid. However, real estates in these states may be potentially more expensive, but at the same time these states may have more employment opportunities. If quality of life such as pollution rates are of concern, other less crowded states maybe better choices for living.

Moreover, real estate values are mainly driven by value generating drivers such as location and the property's ability to generate rent rather than just real estate size. It is important to note that the balance between bathroom and bedroom is very important indicating the liquidity of a property, demands for property is mainly driven by needs of the buyer. If the buying motive is for investment purposes (potential future sale or rent generation), it is important to look into what the target market is demanding.

Furthermore, a 3 bedroom and 2 bathroom real estate is the most demanded real estate for the past 22 years and if this trend continues to the future, this is mainly driven by the average household size in the US. This type of real estate would be a safe and great investment option.